Data Prep for k Value Analysis

Prepare the risk variable data for clustering by:

  1. Converting categorical variables to factors for proper handling

  2. Removing non-clustering variables (demographics, IDs) while preserving participant IDs as row names for later identification

  3. Separating continuous and categorical variables for differential processing

  4. Defining a flexible scaling function that supports multiple normalization methods (i.e., z-score, min-max, percentile, max absolute, and robust scaling) dependent upon which performs optimally in previous testing steps

  5. Applying the specified scaling method (i.e., z scaling) to standardize continuous variables while leaving categorical variables unchanged

## Data Prep ## 

#1. Ensure dichotimization of categorical data for clustering
risk_variable_data <- risk_variable_data %>% 
  mutate(across(c("family_history_depression", "family_history_mania", "bullying"), as.factor))

#2. Retain subject IDs as row names while removing variables that will not be clustered 
#2.1 Remove all variables except subject ID from the dataframe for clustering
risk_variable_clustering_data <- risk_variable_data %>%
  dplyr::select(-session_id, -family_id, -site, -sex, -age, -race_ethnicity)

#2.2 Set subject IDs as row names
row.names(risk_variable_clustering_data) <- risk_variable_clustering_data$participant_id

#2.3 Remove subject ID column
risk_variable_clustering_data <- risk_variable_clustering_data %>% 
  dplyr::select(-participant_id)

#3. Identify continuous and categorical variables
continuous_vars <- names(risk_variable_clustering_data)[sapply(risk_variable_clustering_data, is.numeric)]
categorical_vars <- names(risk_variable_clustering_data)[sapply(risk_variable_clustering_data, is.factor)]

#4. Scaling function to be matched with established param
apply_scaling <- function(df, method) {
  out <- df
  if (method != "none") {
    cont <- df[continuous_vars]
    scaled <- switch(method,
      z_score = scale(cont),
      min_max = purrr::map_df(cont, ~ (.x - min(.x, na.rm=TRUE)) / (max(.x, na.rm=TRUE) - min(.x, na.rm=TRUE))),
      percentile = purrr::map_df(cont, ~ rank(.x, na.last="keep") / sum(!is.na(.x)) * 100),
      max_absolute = purrr::map_df(cont, ~ .x / max(abs(.x), na.rm=TRUE)),
      robust = purrr::map_df(cont, ~ (.x - median(.x, na.rm=TRUE)) / (IQR(.x, na.rm=TRUE) %||% 1))
    )
    out[continuous_vars] <- scaled
  }
  out
}

#5. Scale data according to param set in script call
scaled_data <- apply_scaling(risk_variable_clustering_data, params$scaling_method)

Overview of Determining Optimal Number of Clusters (k Value)

Choosing the appropriate number of clusters (k) is critical for uncovering meaningful groupings in mixed type data. In this analysis, we combine descriptive elbow analysis with (ideally multiple) internal validation indices to arrive at a robust, transparent recommendation for k between 2 and 8:

  1. Elbow method We plot the within cluster sum of squares (WSS) across k = 2:8 and apply the Kneedle algorithm [Satopaa et al., 2011] via the inflection package (Emekes, 2017) to quantitatively identify the elbow point. This provides a visual and numeric guide, but is treated as descriptive rather than prescriptive for purposes of our analyses

  2. Internal validation indices Using validation_kproto(), we generate the silhouette statistical validation index, and loop in any of the other relevant validation indices (i.e., C-Index, Dunn, Gamma, Point-biserial, and Tau) that were able to be completed in the maximum alloed 7 day run-time for each k

  3. Consensus evaluation We record the optimal k returned by each index and examine patterns of agreement or discrepancy. A convergence of multiple indices on the same k will strengthen our confidence in that choice; though a nuanced examination of index scores, the distribution of index scores, and the elbow method will be consulted to

Implementation: parallel k-prototypes + single index validation

In this step we:

  1. Compute k-prototypes clusterings for k = 2:8 once and cache the results

  2. Extract within-cluster sum of squares (WSS) for elbow diagnostics

  3. Run the silhouette index with one SLURM task per k value

  4. Log start/end times, WSS values, and per-index k opt for reproducibility

Checkpoint: aggregate diagnostics & prepare for consensus

The silhouette index specific job has now finished and written:

Next, we will:

  1. Read and combine all WSS into an elbow plot and run Kneedle

  2. Merge each available indexs full k-vs-score trajectories

  3. Compute each available indexs preferred k and select a consensus k

  4. Visualize results via line plots and heatmaps for final inspection

## Determining Optimal Number of Clusters (k Value) ##

#1. Generate clustering solutions & validation indices of interest
#1.1 Establish relevant parameters & paths
#1.1.1 Create a vector containing the k's to test
k_vals <- 2:8

#1.1.2 List where to stash / load the kproto list for this method
proto_file <- file.path(kproto_dir, sprintf("kproto_%s.rds", scaling_method))

#1.2 Load or compute all k-prototypes clusters (unchanged)
if (file.exists(proto_file)) {
  
  #1.2.1 Skip clustering if already done
  kp_list <- readRDS(proto_file)
  cat(
    sprintf("%s|PROTO_SKIP|%s|loaded %d ks from cache\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
            scaling_method, length(k_vals)),
    file = log_file, append = TRUE
  )
  
} else {
  
  #1.2.2 Compute clusters in parallel across k if not completed
  proto_start <- Sys.time()
  cat(
    sprintf("%s|PROTO_START|%s\n",
            format(proto_start, "%Y-%m-%d %H:%M:%OS3"),
            scaling_method),
    file = log_file, append = TRUE
  )
  
  #1.2.3 Store a list containing each kp object (parallel over 7 k's)
  kp_list <- future_map(
    k_vals,
    ~ kproto(x = scaled_data,
             k = .x,
             nstart  = 8,
             verbose = TRUE),
    .options = furrr_options(seed = TRUE)
  )
  
  #1.2.4 Provide a k suffix to each of the k values
  names(kp_list) <- paste0("k", k_vals)
  
  #1.2.5 Save the kproto list
  saveRDS(kp_list, proto_file)
  
  #1.2.6 Log end status for clustering
  proto_end <- Sys.time()
  cat(
    sprintf("%s|PROTO_END|%s|duration=%.1fmin\n", format(proto_end, "%Y-%m-%d %H:%M:%OS3"),
            scaling_method, as.numeric(difftime(proto_end, proto_start, units = "mins"))),
    file = log_file, append = TRUE
  )
}

#1.2.5 Extract total within cluster sum of squares (WSS) for elbow plot  
wss_vec <- map_dbl(kp_list, "tot.withinss")
names(wss_vec) <- k_vals

#1.2.6 Log WSS for each k
for (j in seq_along(k_vals)) {
  cat(
    sprintf("%s|WSS|%s|k=%d|wss=%.1f\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
            scaling_method,  k_vals[j], wss_vec[j]),
    file = log_file, append = TRUE
  )
}

#1.3 Read in all available validation objects (silhouette always there; others if finished)
#1.3.1 Find every val_<method>_<index>.rds for this scaling method
val_files <- list.files(
  validation_dir,
  pattern = sprintf("^val_%s_.*\\.rds$", scaling_method),
  full.names = TRUE
)

#1.3.2 Extract the validation index names from the filenames
indices <- basename(val_files) %>%
  stringr::str_remove(sprintf("^val_%s_", scaling_method)) %>%
  stringr::str_remove("\\.rds$")

#1.3.3 Read them all in and name the list
val_list <- purrr::map(val_files, readRDS)
names(val_list) <- indices

#1.4 Bind & summarize all index results for plotting
index_k_df <- purrr::map_dfr(
  val_list,
  function(vr) {
    ks <- as.integer(names(vr$indices))
    vals <- as.numeric(vr$indices)
    tibble(index = vr$method, k = ks, value = vals)
  }
)

#2. Generate elbow diagnostics & Kneedle (UIK)  
#2.1 Prepare a data.frame for plotting
elbow_df <- tibble(
  k = as.integer(names(wss_vec)),
  wss = as.numeric(wss_vec)
)

#2.2 Create first basic elbow plot
#2.2.1 Generate basic plot
elbow_plot <- ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  labs(
    x = "Number of clusters (k)",
    y = "Total within cluster sum of squares",
    title = "Elbow plot: WSS vs k") +
  theme_minimal(base_size = 14)

#2.2.2 Print the basic elbow plot
print(elbow_plot)

#2.3 Find the knee via the Kneedle (UIK) algorithm
knee <- uik(elbow_df$k, elbow_df$wss)

#2.3.1 Log the UIK elbow into the detailed log
cat(
  sprintf(
    "%s|UIK_ELBOW|%s|k=%d\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
    scaling_method, knee),
  file = log_file, append = TRUE)

#2.4 Create a more detailed elbow plot with UIK annotation
#2.4.1 Generate the elbow plot with the UIK
elbow_plot_uik <- ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  geom_vline(
    xintercept = knee,
    linetype = "dashed",
    size = 0.8,
    color = "red") +
  annotate(
    "text",
    x = knee,
    y = max(elbow_df$wss) * 0.95,
    label = paste0("UIK elbow:\nk = ", knee),
    hjust = 0,
    size = 5,
    color = "red") +
  labs(
    x = "Number of clusters (k)",
    y = "Total WSS",
    title = "Elbow plot with Kneedle (UIK) selection") +
  theme_minimal(base_size = 14)

#2.4.2 Print the elbow plot with the UIK
print(elbow_plot_uik)

#2.5 Cluster compactness / separation metrics per k
#2.5.1 Compute mixed type total sum of squares (TSS) once via 1-cluster prototype
#2.5.1.1 Calculate the kproto object
t1 <- kproto(
  x = scaled_data,
  k = 1,
  nstart  = 1,
  verbose = FALSE)

#2.5.1.2 Store the total within SS value for the whole mixed dataset
TSS_mixed <- t1$tot.withinss

#2.5.2 Loop over each k to compute RMSSD & R2
cluster_metrics <- purrr::map_dfr(
  k_vals,
  function(k) {
    
    #2.5.2.1 Extract the k-prototype fit for this k by position
    position <- which(k_vals == k)
    kp <- kp_list[[position]]
    wss_cl <- kp$withinss
    sz <- as.numeric(kp$size)    
    tot_wss<- kp$tot.withinss
    
    #2.5.2.2 RMSSD: sqrt(mean( SS_cl / cluster_size ))
    rmssd <- sqrt(mean(wss_cl / sz))
    
    #2.5.2.3 R2: proportion of variance explained
    r2 <- (TSS_mixed - tot_wss) / TSS_mixed
    
    #2.5.2.4 Log to detailed log
    cat(
      sprintf(
        "%s|METRICS|%s|k=%d|RMSSD=%.4f|R2=%.4f\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
        scaling_method, k, rmssd, r2), file = log_file, append = TRUE)
    
    #2.5.2.5 Return a one-row tibble with our metrics
    tibble(
      k = k,
      RMSSD = rmssd,
      R2 = r2)
  }
)

#2.5.3 Print a summary table
knitr::kable(
  cluster_metrics,
  caption = glue::glue("Compactness/separation metrics per k ({scaling_method} scaling)"),
  digits = 3) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Compactness/separation metrics per k (z_score scaling)
k RMSSD R2
2 4.309 0.164
3 4.015 0.233
4 3.837 0.278
5 3.853 0.308
6 3.715 0.333
7 3.619 0.352
8 3.539 0.368
#2.6 UMAP & blob diagnostics for selected k values
#2.6.1 Create a function to one hot encode categorical variables + pipe in scaled numerics
prepare_mixed_data <- function(data) {
  cat_cols <- sapply(data, function(x) is.factor(x) || is.character(x))
  num_cols <- sapply(data, is.numeric)
  comps <- list()
  
  if (any(cat_cols)) {
    cd <- data[, cat_cols, drop = FALSE]
    cd[] <- lapply(cd, function(x) if (is.character(x)) as.factor(x) else x)
    comps$cat <- dummy_cols(cd, remove_first_dummy = TRUE, remove_selected_columns = TRUE)
  }
  if (any(num_cols)) {
    nd <- data[, num_cols, drop = FALSE]
    comps$num <- as.data.frame(nd)
  }
  do.call(cbind, comps)
}

#2.6.2 Choose the k's to visualize - current most likely optimal candidates
selected_k <- c(2, 3, 4)

#2.6.3 Loop over each selected k and create UMAP + diagnostics
for (k in selected_k) {
  
  #2.6.3.1 Extract data & cluster labels by position
  pos <- which(k_vals == k)
  kp_obj <- kp_list[[pos]]
  data_k <- kp_obj$data
  cluster_labels<- factor(kp_obj$cluster)
  
  #2.6.3.2 Prepare data_for_umap and run UMAP with tuned params
  data_for_umap <- prepare_mixed_data(data_k)
  umap_config <- umap.defaults
  umap_config$n_neighbors <- 15
  umap_config$min_dist <- 0.1
  umap_config$spread <- 1.5
  umap_config$n_epochs <- 500
  
  #2.6.3.3 Generate the UMAP results
  umap_result <- umap(data_for_umap, config = umap_config)
  
  #2.6.3.4 Build a plotting dataframe
  umap_df <- data.frame(
    UMAP1 = umap_result$layout[,1],
    UMAP2 = umap_result$layout[,2],
    Cluster = as.factor(cluster_labels)
  )
  
  #2.6.3.5 UMAP scatter plot
  p_umap <- ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = Cluster)) +
    geom_point(size = 1.5, alpha = 0.6) +
    labs(
      title = glue::glue("UMAP Visualization of Clusters (k = {k})"),
      subtitle = paste("n_neighbors =", umap_config$n_neighbors, ", min_dist =", umap_config$min_dist),
      x = "UMAP Dimension 1",
      y = "UMAP Dimension 2") +
    theme_minimal() +
    theme(
      legend.title = element_text(size = 12),
      plot.title = element_text(size = 14, hjust = 0.5),
      plot.subtitle = element_text(size = 10, hjust = 0.5)) +
    guides(color = guide_legend(override.aes = list(alpha = 1, size = 3)))
  print(p_umap)
  
  #2.6.3.6 UMAP blob analysis plot
  p_blob_analysis <- ggplot(umap_df, aes(x = UMAP1, y = UMAP2)) +
    geom_density_2d(alpha = 0.3, color = "gray50") +
    geom_point(aes(color = Cluster), size = 1.2, alpha = 0.7) +
    stat_summary_2d(aes(z = as.numeric(Cluster)), fun = mean, alpha = 0.7, bins = 10) +
    labs(
      title = glue::glue("UMAP Blob Analysis (k = {k})"),
      subtitle = "Even overlapping clusters can show meaningful structure",
      x = "UMAP Dimension 1",
      y = "UMAP Dimension 2") +
    theme_minimal() +
    theme(legend.title = element_text(size = 12))
  print(p_blob_analysis)
  
  #2.6.3.7 Individual cluster highlight plots
  unique_clusters <- sort(unique(cluster_labels))
  cluster_plots <- lapply(unique_clusters, function(cl_id) {
    plot_data <- umap_df
    plot_data$highlight <- ifelse(plot_data$Cluster == cl_id, paste("Cluster", cl_id), "Other")
    
    #2.6.3.8 Plot the cluster highlights
    ggplot(plot_data, aes(x = UMAP1, y = UMAP2)) +
      geom_point(aes(color = highlight, alpha = highlight), size = 1) +
      scale_color_manual(values = c("gray80", "red")) +
      scale_alpha_manual(values = c(0.3, 0.8)) +
      labs(title = paste("Cluster", cl_id, "Distribution")) +
      theme_minimal() +
      theme(
        legend.position = "none",
        plot.title = element_text(size = 10),
        axis.text = element_text(size = 8)
      )
  })
  
  #2.6.3.9 Arrange & print up to 6 per row
  if (length(cluster_plots) <= 6) {
    do.call(grid.arrange, c(cluster_plots, ncol = 2, top = glue::glue("k = {k} Individual Cluster Patterns")))
  } else {
    print(cluster_plots)
  }
}

#3. Summarize & visualize validation indices & pick a consensus k 
#3.1 Define which direction better for each index
dir_tbl <- tribble(
  ~index, ~direction, ~label,
  "silhouette", "higher", "Silhouette (higher = better)",
  "cindex", "lower", "C-index (lower = better)",
  "gamma", "higher", "Gamma (higher = better)",
  "ptbiserial", "higher", "Point-biserial (higher = better)",
  "tau", "higher", "Tau (higher = better)"
)

#3.2 Pull out full index vs k values from cached validation objects
#3.2.1 Build a single data.frame with index, k, and value
index_k_df <- purrr::map2_dfr(
  indices, val_list,
  function(idx, vr) {
    ks <- as.integer(names(vr$indices))
    vals<- as.numeric(vr$indices)
    tibble(index = idx, k = ks, value = vals)
  }
)

#3.2.2 Now filter dir_tbl to only the indices we actually have for the current run
dir_tbl2 <- dir_tbl %>% filter(index %in% index_k_df$index)

#3.2.3 Join direction & label onto the data
index_k_df <- index_k_df %>%
  left_join(dir_tbl2, by = "index")

#3.3 Compute each index's best k
summary_tbl <- index_k_df %>%
  group_by(index, direction, label) %>%
  summarise(
    k_opt = if (direction[1] == "lower") k[which.min(value)] else k[which.max(value)],
    index_opt = if (direction[1] == "lower") min(value) else max(value),
    .groups = "drop"
  )

#3.3.1 Determine consensus k (at least according to simple vote)
consensus_k <- summary_tbl$k_opt %>% table() %>% which.max() %>% names() %>% as.integer()

#3.3.2 Print and log the "consensus k"
cat(sprintf("%s|CONSENSUS_K|%s|%d\n", format(Sys.time(), "%Y-%m-%d %H:%M:%OS3"),
            scaling_method, consensus_k),
    file = log_file, append = TRUE)

#3.4 Create a line plot of all indices vs k
#3.4.1 Generate the line plot for each index x k value
index_line_plot <- ggplot(index_k_df, aes(x = k, y = value, color = label)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  geom_vline(xintercept = consensus_k, linetype = "dashed", color = "grey50") +
  annotate(
    "text", x = consensus_k + 0.2,
    y = max(index_k_df$value)*0.95,
    label = paste0("Consensus k = ", consensus_k),
    hjust = 0, size = 5) +
  labs(
    x = "Number of clusters (k)",
    y = "Validation-index value",
    color = "Index & direction",
    title = glue::glue("Validation-index trajectories ({scaling_method} scaling)")) +
  theme_minimal(base_size = 14)

#3.4.2 Print the line plot for each index x k value
print(index_line_plot)

#3.5 Create a heatmap of index values x each k
#3.5.1 Generate the heatmap
index_heatmap <- ggplot(index_k_df, aes(x = factor(k), y = label, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(option = "magma") +
  labs(
    x = "k",
    y = NULL,
    fill = "Value",
    title = "Heatmap of validation-index values") +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(vjust = 0.5))

#3.5.2 Print the heatmap
print(index_heatmap)

#3.6 List the tabular summary of each index's optimal k & value
summary_tbl %>%
  arrange(label) %>%
  dplyr::select(label, k_opt, index_opt) %>%
  knitr::kable(
    caption = glue::glue("Optimal k per index ({scaling_method} scaling)"),
    col.names = c("Index (direction)", "optimal k", "Value"),
    digits = 3) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
Optimal k per index (z_score scaling)
Index (direction) optimal k Value
Silhouette (higher = better) 2 0.518

Interpretation: